AITopics | spectral energy distance

A Spectral Energy Distance for Parallel Speech Synthesis

Neural Information Processing SystemsDec-24-2025, 08:21:22 GMT

Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees. The distance can be calculated from minibatches without bias, and does not involve adversarial learning, yielding a stable and consistent method for training implicit generative models. Empirically, we achieve state-of-the-art generation quality among implicit generative models, as judged by the recently-proposed cFDSD metric. When combining our method with adversarial techniques, we also improve upon the recently-proposed GAN-TTS model in terms of Mean Opinion Score as judged by trained human evaluators.

name change, parallel speech synthesis, spectral energy distance, (4 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.60)

Add feedback

9873eaad153c6c960616c89e54fe155a-Supplemental.pdf

Neural Information Processing SystemsAug-15-2025, 07:04:09 GMT

architecture, convolution, implementation, (15 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

A Spectral Energy Distance for Parallel Speech Synthesis

Neural Information Processing SystemsAug-15-2025, 07:04:00 GMT

This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees.

arxiv preprint arxiv, energy distance, synthesis, (14 more...)

Neural Information Processing Systems

Country:

Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
North America > Canada (0.04)

Industry:

Education (0.46)
Leisure & Entertainment (0.35)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Review for NeurIPS paper: A Spectral Energy Distance for Parallel Speech Synthesis

Neural Information Processing SystemsJan-26-2025, 20:11:59 GMT

Additional Feedback: Comments: - Section 2: Flow-based models are not necessarily large. The new SOTA WaveFlow is a small-footprint flow-based model for raw audio. The authors may reference WaveFlow and clarify the inaccurate claim in related work section. I usually don't take such FDSD measures seriously, as it couldn't provide meaningful comparisons across different models in general, which is also observed by the authors. It would very nice to see an ablation study with MOS scores by varying three design choices: 1) w/ or w/o repulsive term, 2) single or multi-scale spectrogram loss, 3) w/ or w/o GAN loss. It will single out and emphasize the benefit of repulsive term under different circumstances.

multi-scale spectrogram loss, parallel speech synthesis, repulsive term, (8 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.40)

Add feedback

Review for NeurIPS paper: A Spectral Energy Distance for Parallel Speech Synthesis

Neural Information Processing SystemsJan-26-2025, 20:11:51 GMT

This paper proposes a strategy for parallel TTS based on spectral energy distance. It does not rely on explicit optimization of likelihood nor adversarial learning, which enjoys a more stable and consistent training. On top of that, the authors introduce a repulsive term which has shown to significantly improve the quality of the generated speech. When combined with adversarial training, the quality of speech can be further improved. Overall, this is an interesting work, technically solid and experimentally compelling.

neurips paper, parallel speech synthesis, spectral energy distance

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.40)

Add feedback

A Spectral Energy Distance for Parallel Speech Synthesis

Neural Information Processing SystemsOct-10-2024, 20:44:49 GMT

Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees.

implicit generative model, parallel speech synthesis, spectral energy distance

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.64)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.62)

Add feedback

A Spectral Energy Distance for Parallel Speech Synthesis

Gritsenko, Alexey A., Salimans, Tim, Berg, Rianne van den, Snoek, Jasper, Kalchbrenner, Nal

arXiv.org Machine LearningOct-23-2020

Speech synthesis is an important practical generative modeling problem that has seen great progress over the last few years, with likelihood-based autoregressive neural models now outperforming traditional concatenative systems. A downside of such autoregressive models is that they require executing tens of thousands of sequential operations per second of generated audio, making them ill-suited for deployment on specialized deep learning hardware. Here, we propose a new learning method that allows us to train highly parallel models of speech, without requiring access to an analytical likelihood function. Our approach is based on a generalized energy distance between the distributions of the generated and real audio. This spectral energy distance is a proper scoring rule with respect to the distribution over magnitude-spectrograms of the generated waveform audio and offers statistical consistency guarantees. The distance can be calculated from minibatches without bias, and does not involve adversarial learning, yielding a stable and consistent method for training implicit generative models. Empirically, we achieve state-of-the-art generation quality among implicit generative models, as judged by the recently-proposed cFDSD metric. When combining our method with adversarial techniques, we also improve upon the recently-proposed GAN-TTS model in terms of Mean Opinion Score as judged by trained human evaluators.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Machine Learning

2008.0116

Country: